/*
THIS FILE USES NHIS DATA TO EXAMINE THE IMPACTS OF DACA ON HEALTH AND ECONOMIC OUTCOMES

DATE: January 8, 2020

IMPORTANT NOTE!! This file reflects some changes that were brought to our attention by an outside research groups, who wish to remain anonymous 
We are grateful to them for doing so. 
The substantive research design and specifics of the code assigning treatment have NOT changed. 
Estimated effects are actually slightly STRONGER as a result of these changes. The specific change in question has to do with the construction of 
the main outcome (psychological distress). T

The previous version of this file summed in one measure that was not part of the Kessler-6, introducing noise into this measure.
Correctly calculating the Kessler 6 does not change any of the substantive findings (i.e., original Poisson IRR on mental health 0.78; this version 0.72; original OR on moderate distress: 0.62, this version 0.50).

Comments welcome; please let us know (avenkataramani [AT] partners [DOT] org) if you find any errors. 

POLICY DETAILS: http://www.immigrationpolicy.org/issues/DREAM-Act
DACA Eligibles were under age 31 on June 15 2012, came to US before age 16, needed to be in the country for at least 5 years
Enrolled in school, completed HS/GED, honorable discharged from armed forces

DATA SOURCES: The analysis uses data from the 2000-2015 National Health Interview Survey (NHIS) 
from the Minnesota Population Center's Integrated Public Use Microdata Samples.
See here: https://ihis.ipums.org/.
Our extract was from May 2016. It is possible that there may be minor changes in extracts as the source data is
updated over time. Consequently, this may explain any small differences in point estimates, samples sizes, etc that may 
arise when you run our code on your own extract. 

The do file "$nhis/nhis_setup_ipums_20002015.do" desribes the exact sample, time period, and variables extracted
to aid replication. 

OUTCOMES: Self reported health and Kessler 6 scale. In a prior version of the paper we mistakenly labelled the K6
as the Patient Health Questionnaire 9 (PHQ-9). This will be updated in the published paper with an accompanying 
errata to flag the error.

APPROACH: Based on prior analysis from the ACS, we adopt an approach of focusing on a very tight control group in a difference-indifferences set up. 
I.e., we focus on high school (or better) educated individuals who lived in the US for at least 5 years of Hispanic ethnicity.
This choice is because of differential trends in SES over time by education, differences in SES of more recent immigrants, and that DACA beneficiaries were 
predominantly Hispanic ethnicity. Thus our identification is based on individuals who otherwise are similar but who meet DACA criteria based on age at policy and timing
of immigration.

CHALLENGES: Timing of immigration is difficult to ascertain because the NHIS public use data provides binned values (5-10, 10-15, 15+ years). For individuals under age 31, those in the 
first bin would be DACA ineligible, in the second bin potentially DACA eligible and in the 3rd bin ineligible. Based on ACS data, we use the midpoint of each bin to assign age
at immigration, knowing that this will be noisy (and thereby downward bias our estimates). For those above 31 at the time of policy, none would be eligible. We control for age at policy FE
and timing of immigration bin FE in all of our models. It is possible to estimate from the ACS a regression that would assign age of immigration to each individual. We do not include
this in our published paper, but we show below that doing so does not change our results. 

THIS FILE'S SETUP: The first part of the file sets up our data. It creates a data set from the IPUMS NHIS using the do file nhis_setup_2002015_Feb92017. This file provides the dictionary for
the *.dat extract that we obtained. The second part of the file cleans the variables and defines exposure. The third estimates our models. 

*/

**FILE PATHS - DEFINE YOUR OWN
*here we have one for the source data, the other for output
*user can define paths as they see fit

global nhis "[ADD YOUR FILEPATH HERE]/nhis/"
global output "[ADD YOUR FILEPATH HERE]/output/"

**RUN NHIS SETUP
run "$nhis/nhis_setup_ipums_20002015.do"

**CLEAN VARIABLES AND DEFINE ESTIMATE SAMPLE

**Restrict sample
keep if year>2007
keep if age<51

**Define demographic variables (some of which are NOT used in this analysis)
recode hispeth (10 = 0) (20/max = 1)
gen race2 = racea
recode race2 (100 = 0) (200 = 1) (310 = .) (410/499 = 3) (500/max = .)
replace race2 = 2 if hispeth==1

label define racel 0 "White" 1 "Black" 2 "hispic" 3 "Asian"
label values race2 racel

gen married = marst
recode married (0=.) (11/12 = 1) (13/max = 0)

recode educrec (90/max = .) (0=.)
gen yrschl = educrec - 1
recode yrschl (13 = 14) (14/15 = 16)
mark highschool if educrec>=12
mark somecollege if educrec>=13
mark college if educrec>=15

gen inccat = incfam97on2 
recode inccat (10=1) (20=2) (31=3) (32 = 4) (98/max = 5)

gen insurance = hinotcove
recode insurance (2=1) (1=0) (3 = 0)

**MENTAL HEALTH
/*Create Kessler 6 Scale
Based on published literature, will also define a discrete cutoff to denote clinically relevant
moderate distress
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3370145/
*/

/*this is the extra variable added to the K6 index that should not have been added*/
drop afeelint1mo

foreach x of varlist aeffort-aworthless {
	recode `x' (6/max = .)
	}
egen k6_index = rsum(aeffort-aworthless)
recode k6_index (0=.) if aeffort==.

gen mod_dist = k6_index
recode mod_dist (0/4 = 0) (5/max = 1)

**SELF_REPORTED (PHYSICAL) HEALTH
recode health (7/max = .)
recode health (1=5) (2=4) (4=2) (5=1)
mark poor_health if health<=2

**DEFINE NATIVITY AND IMMIGATION - KEY TO DEFINE EXPOSURE
/*We first create an indicator of whether the individual immigrated before age 16. 
The assigned bin values reflect analysis that we did using the ACS, which provides continuous values. Using 
the mean of the bin follows the actual mean of similar bins in the ACS.
We subtract the binned value from age to calculate age of immigration as below (imm16).
Our simplistic way of binning should further serve as a lower bound, since on margin it would
misassign some to tx and some to control.
*/

gen foreign_born = regionbr
recode foreign (1=0) (2/11 = 1) (99 = .)
gen imm16 = .
gen temp = yrsinus
recode temp (0=.) (1=1) (2 = 3) (3 = 7.5) (4 = 12.5) (5 = 22)

recode imm16 (.=1) if foreign==0
recode imm16 (.=1) if age - temp <= 16
recode imm16 (.=0) if age - temp > 16

gen non_citizen = citizen
recode non_citizen (2 = 0) (7/9 = 0)

gen age_imm = age - temp
recode age_imm (-3/-1 = 0)

**DEFINE POLICY TIMING (BY INTERVIEW MONTH AND YEAR, UNLIKE IN ASEC)
gen yr_mo = year + (intervwmo - 1)/12
mark post if yr_mo>2012.5

recode birthmo (13/max = .)
recode birthyr (2016/max = .)

gen age_pol = 2012 - (year - age)

/* Below is code to assign age based on birth month and year, imputing for those individuals we don't have birth year on. Yields the same results
gen age_pol = 2012.5 - ((birthyr + ((birthmo-1)/12)))
replace age_pol = 2012.5 - (yr_mo - age) if age_pol==.
replace age_pol = floor(age_pol)
*/

gen age_cat = age_pol
recode age_cat (0/10 = .) (11/15 = 1) (16/20 = 2) (21/25 = 3) (26/31 = 4) (32/35 = 5) (36/40 = 6) (41/45 = 7) (46/max = 8)

***IDENTIFY ELIGIBLES
gen elig1 = 0
recode elig1 (0=1) if imm16==1&non_citizen==1&age_pol<=31

*Exposure term for DD model
gen post_elig1 = post*elig1

***RESULTS 

log using "$output/descriptives"

*Set sample restriction, controls
global sample "non_citizen==1&hispeth==1&yrsinus>2&age>18&educ>=13&year>=2008"
global controls "i.age_pol i.temp i.sex i.region i.year*i.intervwmo"

**MEANS/SD OF MAIN SAMPLE
xi: reghdfe health i.post*elig1 $controls if $sample, abs(sex year intervwmo  region)
mark sample if e(sample)
global outcomes1 "health poor_health"
global outcomes2 "k6_index mod_dist"

tab intervwmo, gen(month)
tab region, gen(reg)

*unweighted
sum $outcomes1 $outcomes2 if sample==1
bysort elig1: sum $outcomes1 $outcomes2 age age_imm sex month* reg* if sample==1
bysort elig1: sum $outcomes1 $outcomes2 age age_imm sex month* reg* if post==0&sample==1
bysort elig1: sum $outcomes1 $outcomes2 age age_imm sex month* reg* if post==1&sample==1

*weighted (alternatively can do this with SVY, ends up giving equiv answers; see notes on this below)
sum $outcomes1  if sample==1 [aw = perweight]
sum $outcomes2  if sample==1 [aw = sampweight]

bysort elig1: sum $outcomes1 age age_imm sex month* reg* if sample==1 [aw = perweight]
bysort elig1: sum $outcomes1  age age_imm sex month* reg* if post==0&sample==1 [aw = perweight]
bysort elig1: sum $outcomes1  age age_imm sex month* reg* if post==1&sample==1 [aw = perweight]

bysort elig1: sum $outcomes1 age age_imm sex month* reg* if sample==1 [aw = sampweight]
bysort elig1: sum $outcomes1  age age_imm sex month* reg* if post==0&sample==1 [aw = sampweight]
bysort elig1: sum $outcomes1  age age_imm sex month* reg* if post==1&sample==1 [aw = sampweight]

log close

***REGRESSIONS
/*Note: we follow the IPUMS NHIS guidelines when using weights. In particular, perweight (PERSON WEIGHT) is to be
used for analysis involving variables that were collected for all individuals. sampweight (SAMPLE WEIGHT) is to be used
for analysis where data was collected on a random subsample. 
See: https://ihis.ipums.org/ihis/userNotes_weights.shtml

Also note that we do not use SVY commands here given some strata with a single unit (which does not allow us to
calculate S.E.s. We show below though that our [pw = X] method and svy produce virtually the same pt estimates and S.Es
*/

**TABLE 2
log using "$output/main_regressions"

xi: poisson k6_index post elig1 post_elig1 $controls if $sample [pw = sampweight], robust irr
xi: reg health post elig1 post_elig1 $controls if $sample [pw = perweight], robust

xi: logit poor_health post elig1 post_elig1 $controls if $sample  [pw = perweight], robust or
xi: logit mod_dist post elig1 post_elig1 $controls if $sample [pw = sampweight], robust or

log close

**TABLE 3
/*Falsification - less than HS
Note - the model for k6_index does not converge easily, so we replace the age_pol FE with a binned (4 year age bins) version, which allows convergence.
The OLS and unweighted models yield similar (null) point estimates, justifying this choice.
*/

global samplehs "non_citizen==1&hisp==1&yrsinus>2&age>18&educ<13&year>=2008"

log using "$output/robustness_checks"

xi: poisson k6_index post elig1 post_elig1 i.age_cat i.temp i.sex i.region i.year i.intervwmo  if $samplehs [pw = sampweight], robust irr
xi: reg health post elig1 post_elig1 $controls if $samplehs [pw = perweight], robust

xi: logit poor_health post elig1 post_elig1 $controls if $samplehs [pw = perweight], robust or
xi: logit mod_dist post elig1 post_elig1 $controls if $samplehs [pw = sampweight], robust or

*2010 onwards
global samplerecent "non_citizen==1&hisp==1&yrsinus>2&age>18&educ>=13&year>=2010"

xi: poisson k6_index post elig1 post_elig1 $controls if $samplerecent [pw = sampweight], robust irr
xi: reg health post elig1 post_elig1 $controls if $samplerecent [pw = perweight], robust

xi: logit poor_health post elig1 post_elig1 $controls if $samplerecent  [pw = perweight], robust or
xi: logit mod_dist post elig1 post_elig1 $controls if $samplerecent [pw = sampweight], robust or

*Young and recent (less than 40)
global sampleyoung "non_citizen==1&hisp==1&educ>=13&year>=2010&age>=19&age_pol<=40&yrsinus>2"
global controlsyoung "i.age_pol i.temp i.sex i.region i.year*i.intervwmo"

xi: poisson k6_index post elig1 post_elig1 $controlsyoung if $sampleyoung [pw = sampweight], robust irr
xi: reg health post elig1 post_elig1 $controlsyoung if $sampleyoung [pw = perweight], robust

xi: logit poor_health post elig1 post_elig1 $controlsyoung if $sampleyoung [pw = perweight], robust or
xi: logit mod_dist post elig1 post_elig1 $controlsyoung if $sampleyoung [pw = sampweight], robust or

log close

***SUPPLEMENTAL APPENDIX MATERIALS - UNWEIGHTED
*See text for discussion on why weighted models are correct for these data

log using "$output/appendix"

*Main
xi: poisson k6_index post elig1 post_elig1 $controls if $sample , robust irr
xi: reg health post elig1 post_elig1 $controls if $sample , robust

xi: logit poor_health post elig1 post_elig1 $controls if $sample  , robust or
xi: logit mod_dist post elig1 post_elig1 $controls if $sample , robust or

*Recent
xi: poisson k6_index post elig1 post_elig1 $controls if $samplerecent , robust irr
xi: reg health post elig1 post_elig1 $controls if $samplerecent , robust

xi: logit poor_health post elig1 post_elig1 $controls if $samplerecent  , robust or
xi: logit mod_dist post elig1 post_elig1 $controls if $samplerecent , robust or

*Young and recent
xi: poisson k6_index post elig1 post_elig1 $controlsyoung if $sampleyoung , robust irr
xi: reg health post elig1 post_elig1 $controlsyoung if $sampleyoung , robust

xi: logit poor_health post elig1 post_elig1 $controlsyoung if $sampleyoung , robust or
xi: logit mod_dist post elig1 post_elig1 $controlsyoung if $sampleyoung , robust or

log close

***IMPORTANT ADDITIONAL MATERIALS!***

**WHY DON'T WE USE SVY?
/*Within strata problem makes it hard to estimate S.E.s (i.e., one obs within strata)
When we eliminate single observation strata, S.E.s and point estimates are virtually identical this way
Below is an example
(svyset as recommended by IPUMS NHIS - https://ihis.ipums.org/ihis/userNotes_variance.shtml
*/

svyset psu [pweight=sampweight], strata(strata)

xi: svy: poisson k6_index post elig1 post_elig1 $controls if $sample&k6_index~=. , irr

*Identify single strata observations and run
svydes if e(sample), single gen(single)

xi: svy: poisson k6_index post elig1 post_elig1 $controls if $sample&k6_index~=.&single==0 , irr
